Loading Data from Amazon S3 into PySpark DataFrames using AWS Glue

Loading Data from Amazon S3 into PySpark DataFrames using AWS Glue

Clock Icon2024.09.17

Introduction

Hello, I'm Hemanth from the Alliance Department. In this blog, I will walk you through the process of loading data from Amazon S3 into PySpark DataFrames using AWS Glue. This process is essential for anyone working with data pipelines in the cloud, as it combines S3's storage capabilities with PySpark's data processing power in a fully managed environment.

AWS

Amazon Web Services, or AWS, is a cloud service platform that provides content distribution, database storage, processing capacity, and other features to support corporate expansion. AWS has offered a broad range of services in many different categories, including Compute, Storage, Networking, Database, Management Tools, and Security.

S3

Simple and popular AWS Service for storage. Replicates data by default across multiple facilities. It charges per usage. It is deeply integrated with AWS Services. Buckets are logical storage units. Objects are data added to the bucket. S3 has a storage class on object level which can save money by moving less frequently accessed objects to a colder storage class.

AWS Glue

One fully managed ETL (Extract, Transform, Load) solution that simplifies the process of preparing and loading data for analytics is AWS Glue. With the help of Glue, users may extract data from many sources, transform it with a range of tools, and then put it into a destination like a data lake or warehouse, streamlining the ETL process.

Demo

Navigate to the S3 service in the AWS Management Console. Click on Create Bucket.
Screenshot 2024-09-17 at 13.53.20
Provide a unique name for your bucket and leave the remaining options as default. Click Create Bucket.
Screenshot 2024-09-17 at 13.54.38
Once the bucket is created, upload the files that you intend to load into PySpark.
Screenshot 2024-09-17 at 13.56.37
Copy the S3 URI of these files for the Glue job later.
Screenshot 2024-09-17 at 14.03.00
Navigate to IAM from the Management Console and click Create Role.
Screenshot 2024-09-17 at 14.07.51
Select Glue as the service that will use this role, and click Next.
Screenshot 2024-09-17 at 14.09.08
Add the necessary permissions, including:
Screenshot 2024-09-17 at 14.11.10
Then give a rule name and click create role.
Screenshot 2024-09-17 at 14.12.51
In the AWS Glue dashboard, navigate to ETL Jobs and click on Script Editor.
Screenshot 2024-09-17 at 14.16.16
Click on Create script.
Screenshot 2024-09-17 at 14.17.31
Provide a meaningful name for the job and In the Job Details section, select the IAM role created earlier. Click Save.
Screenshot 2024-09-17 at 14.20.01
Under the Script section, paste the PySpark code that reads data from the S3 bucket and loads it into a DataFrame. Use the S3 URI copied earlier in the code.
Screenshot 2024-09-17 at 14.26.57
Click on run, and click on output logs once job running is successfull as shown below.
Screenshot 2024-09-17 at 14.43.25
Screenshot 2024-09-17 at 14.44.19
The output verifies that data has been succesfully read from S3 files to PySparkDataframes.
Screenshot 2024-09-17 at 14.47.58

Conclusion

By following this step-by-step guide, you have successfully learned how to load data from Amazon S3 into PySpark DataFrames using AWS Glue. This method offers a scalable and efficient way to handle large datasets in the cloud, leveraging the powerful combination of S3's storage capabilities and PySpark's data processing engine. AWS Glue simplifies the process, making it easier for data engineers and developers to build and maintain data pipelines.

Share this article

facebook logohatena logotwitter logo

© Classmethod, Inc. All rights reserved.